Exome-wide evidence of compound heterozygous effects across common phenotypes in the UK Biobank

Summary The phenotypic impact of compound heterozygous (CH) variation has not been investigated at the population scale. We phased rare variants (MAF ∼0.001%) in the UK Biobank (UKBB) exome-sequencing data to characterize recessive effects in 175,587 individuals across 311 common diseases. A total of 6.5% of individuals carry putatively damaging CH variants, 90% of which are only identifiable upon phasing rare variants (MAF < 0.38%). We identify six recessive gene-trait associations (p < 1.68 × 10−7) after accounting for relatedness, polygenicity, nearby common variants, and rare variant burden. Of these, just one is discovered when considering homozygosity alone. Using longitudinal health records, we additionally identify and replicate a novel association between bi-allelic variation in ATP2C2 and an earlier age at onset of chronic obstructive pulmonary disease (COPD) (p < 3.58 × 10−8). Genetic phase contributes to disease risk for gene-trait pairs: ATP2C2-COPD (p = 0.000238), FLG-asthma (p = 0.00205), and USH2A-visual impairment (p = 0.0084). We demonstrate the power of phasing large-scale genetic cohorts to discover phenome-wide consequences of compound heterozygosity.


Figure S3
: Agreement between read-backed and statistical phase estimation, related to Star Methods.Genetic phase was estimated using WhatsHap (read-backed phasing) and SHAPEIT5 (statistical phasing) in 49,756 individuals across all autosomes.We only carried forward pairs of variants close proximity in which phase could be inferred using WhatsHap.
We combined with statistically phased counterparts derived from SHAPIET5 and determine % disagreement of phase estimation of variant pairs on the -axis, when filtering to phased pairs of variants where the minimum PP >  for  ∈ {0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.99} according to the color legend.We stratify pairs of variants into bins based on the minimum MAC in the variant pair, on the -axis.Mean disagreement rates are plotted on -axis with whiskers enclosing the 95% binomial CI

CHs only
Figure S8: Simulation of CH and homozygous events in an outbred population, related to Star Methods.We generated genotypes for 1174 genes using allele frequencies derived from observed pLoF variants.For each gene, we simulated genotypes for 176,935 individuals.We averaged the number of bi-allelic variants across 10 simulations.This served as an estimate for the expected count (-axis) of bi-allelic variants, against which we compared the empirically observations (-axis).The first panel (left) is the comparison between observed and simulated homozygous variants.The second panel (right) is the comparison between observed and simulated CH variants.We simulate phenotypic data applied to 100,000 genetically-ascertained NFE on chromosome 22 (Methods) under the liability-threshold model assuming a spike and slab genetic architecture.We assume a 10% disease prevalence and 25% causal genes, and consider varying levels of phenotypic variance explained by these effects ∈ {0, 0.01, 0.02, 0.05, 0.10}.We then apply SAIGE to the simulated phenotypes, testing for an association between presence of a bi-allelic variant in each gene and case status.a) Each panel indicates a set of simulations assuming varying levels of heritability and average effect as labeled in the subtitles.In each panel, we plot the true effect size in the simulation for a given gene on -axis against the corresponding − log 10 () value of association.Areas of circles correspond to the number of samples harboring bi-allelic damaging variants in the 100,000 samples according to the legend.b) To assess the sensitivity and specificity of our approach, we created ROC-AUC curves for each combination of increasing phenotypic variance explained (facet) and increasing average affect (red lines).c) For each ROC-AUC curve from b, we calculate the AUC.White indicates low AUC and red indicates higher AUC.The scatter plot depicts the association P-values both before and after PRS was included as a covariate.The y-axis represents the P-value prior to PRS adjustment, while the x-axis demonstrates the P-value afterPRS adjustment.On the right, the difference in logtransformed P-values before and after PRS adjustment is displayed.The plot highlights gene-trait associations that were considered Bonferroni significant in the recessive analysis.Throughout, we display hazard ratios with (circles) and without (triangles) taking the polygenic contribution into account by conditioning on off-chromosome PRSs for heritable traits that pass our quality control cutoffs.HRs for gene-traits with one or more individuals with multiple cis variants on the same haplotype are also displayed in pink.Associations that pass either Bonferroni significance ( < 1.89 × 10 −7 ) or FDRs < 0.1 cutoff are demarcated by the dashed line in the top and bottom half respectively.Abbreviations: CC (colorectal cancer), COPD (chronic obstructive pulmonary disease).For traits where over 50% of cases are left-censored, the confidence interval estimates cannot be accurately determined using Kaplan-Meier curves, and thus, these should be disregarded.For this reason, wildtype confidence intervals for FLG-Asthma are not displayed in the figure.Wildtype and CH confidence intervals are also not shown for FLG-Dermatitis.

Figure S1 :Figure S2 :
Figure S1: Phased variants retained as a function of phasing confidence score, related to Figure1.Each subplot displays the number of variants retained on the log 10 scale as the PP is increased, split by rarity of variants described in the subplot title.Dotted red and blue lines highlight the number of variants retained after imposing PP cut-offs of 0.9 and 0.99, respectively.

Figure S4 :
Figure S4: Agreement between read-backed phased and statistically phased singleton variants, related to Figure1.Across the 49,756 samples, we identified 80,978 reads with at least one singleton (MAC= 1) variant irrespective of phasing quality.For each PP-bin we determine the agreement between read-backed phased variants and statistically phased variants with red indicating disagreement and blue indicating agreement.With higher PP-score, the proportion of correctly phased variants increases, while the total number of variants assessed decreases.

Figure S6 :Figure S7 :
Figure S6: Distribution of variant annotation categories before and after broad consequence categorization and before and after filtering by PP≥ 0.9, related to Figure1.We annotate variants using VEP and by the most severe consequence in the canonical transcript.Panels (a) and (b) display the total number of unique variants observed across a set of variant consequences colored by degree of predicted impact, before and after broad variant consequence categorization.Panels (c) and (d) depicts the same as above but after restricting to variants with PP≥ 0.9.In each panel, green, orange and red colored bars indicate low, medium and high impact respectively, according to the color legends.Singleton variation within the variant class is stacked and displayed in a lighter shade.Counts of variant within each annotation category are displayed above the bars.Note that all counts shown here are before filtering to accurately phased variants.

Figure S11 :Figure S12 :Figure S13 :Figure S14 :
Figure S11: Distribution of observed variants across samples by allele frequency, related to Star Methods.Histogram of unique bi-allelic variant (CH and homozygotes) prevalence across the allele frequency spectrum.For a qualifying CH variant, the allele frequency corresponding to the alternate allele on the rarest haplotype are plotted.

Figure S16 :Figure S17 :Figure S18 :Figure S19 :
Figure S16: Recessive association analysis without accounting for PRS, related to Figure 2. Recessive Manhattan plot depicting log 10 -transformed gene-trait association P-values versus chromosomal location.Associations are colored red if they are Bonferroni significant ( < 1.68 × 10 −7 ).Any gene-trait association with  < 3.05 × 10 −6 (nominal significance) has been labeled with gene symbol.No additional conditioning was carried out in this analysis.

Figure S20 :Figure S21 :
Figure S20: Overview of attrition for Bonferroni significant associations after successive conditioning steps or filters, related to Star Methods.This bar chart presents the number of Bonferroni significant associations that remain after successive conditioning steps in a genetrait regression model or variant filters.The first bar represents all initial Bonferroni significant associations.The second bar shows the impact of conditioning on off-chromosome PRS.The third bar accounts for nearby common variants, which eliminates two gene-trait pairs.Subsequent bars indicate the effect of further conditioning on rare variants in the gene and an additive model of affected haplotypes, neither of which reduces the number of associations.The last two bars separate the associations those that remain after filtering to compound heterozygous or homozygous variants, respectively.
Figure S22: Kaplan-Meier survival curves for carriers of bi-allelic variants, related to Figure5.Trajectories for wildtypes and bi-allelic (CH or homozygous) carriers of damaging missense/protein-altering mutations are shown with green and black lines respectively.For traits where over 50% of cases are left-censored, the confidence interval estimates cannot be accurately determined using Kaplan-Meier curves, and thus, these should be disregarded.Consequently, wildtype confidence intervals for FLG-Asthma and FLG-Dermatitis are not displayed in the figure.

Figure S23 :
Figure S23: Kaplan-Meier survival curves for carriers of CH, homozygous, heterozygous variants, related to Figure5.Kaplan-Meier survival curves for CH (red), homozygous (orange), heterozygous carriers (blue), single disruption of haplotypes (pink) owed to pLoF or damaging missense/protein-altering mutations.Wildtypes are shown in green.For traits where over 50% of cases are left-censored, the confidence interval estimates cannot be accurately determined using Kaplan-Meier curves, and thus, these should be disregarded.For this reason, wildtype confidence intervals for FLG-Asthma are not displayed in the figure.Wildtype and CH confidence intervals are also not shown for FLG-Dermatitis.

Figure S24 :Figure S25 :
Figure S24: Co-occurrence of deleterious ATP2C2 variants by COPD status, related to Figure 5. Bi-allelic variant occurrence in ATP2C2 for chronic obstructive pulmonary disease (COPD).The constituent variants are shown alongside the variant consequence and involved exon or intron.Each tile indicates that number of individuals are cases out of the total bi-allelic carriers identified.Only the variants that affect both gene copies are shown.Stars (*) are included in the label to indicate homozygosity.

remaining after all filters 9,169,408 57.6 Table S24: Summary of variant filters, related to Star Methods.
Moving down through the rows of the table, we move through QC filtering steps

Agreement between read-backed phasing and statistical phasing, related to Star Methods. We plot the disagreement between
c

Allele frequencies of variants in the CH state, related to Star Methods
. Heatmap of allele counts for variants in CH state stratified by predicted variant consequence (damaging missense, pLoF or pLoF+damaging missense).We plot the MAC for variants residing on the most common haplotype (y-axis) versus the rarest haplotype (x-axis).N Figure S10: Allele frequencies of variants in cis, related to Star Methods.Heatmap of allele counts for co-occurring variants on the same haplotype stratified by predicted variant consequence (damaging missense, pLoF or pLoF+damaging missense).The most common variant on the haplotype versus the least common are plotted.